Big Cities Health Data

Courtney Ferguson Lee

Introduction

This notebook seeks to identify health trends from the Big Cities Health Inventory (BCHI). The BCHI is a compilation of health data spanning 53 indicators across 27 major US cities. The original dataset contains 17 variables and 18329 records collected from various sources. The indicators are broken down into 11 categories, including:

  1. Behavioral Health and Substance Abuse
  2. Cancer
  3. Chronic Disease
  4. Demographics
  5. Environmental Health
  6. Food Safety
  7. HIV/AIDS
  8. Infectious Disease
  9. Injury and Violence
  10. Life Expectancy/Overall Death Rate
  11. Maternal and Child Health

During this exploration, we will reshape the data into 58 variables across 1814 records to allow for multivariate analysis.

Load Modules

In [2]:
# Data manipulation
import pandas as pd
pd.options.display.max_columns = 60
import numpy as np
from IPython.display import display

# Visualizations
import missingno as msno
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid", {'axes.grid': False})

# Machine learning
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

Load and Sample Data

The following creates a small sample file that can be opened easily in Excel.

In [3]:
import unicodecsv

with open('data/bchi.csv', 'rb') as fin:
    freader = unicodecsv.reader(fin, delimiter=',')
    with open('data/bchi_sample.csv', 'wb') as fout:
        fwriter = unicodecsv.writer(fout, delimiter=',')
        count = 0
        for row in freader:            
            fwriter.writerow(row)
            count += 1
            if count==250:
                break
In [4]:
health = pd.read_csv('data/bchi.csv')
health_df = pd.DataFrame(health)
In [5]:
location_df = health_df.copy()
location_df.loc[:,'City'] = location_df['Place'].str.split(', ').str.get(0)
location_df.loc[:,'State'] = location_df['Place']
location_df.loc[location_df['State']!='U.S. Total','State'] = location_df['Place'].str.split(', ').str.get(1)
location_df.to_csv('data/bhmi_locations.csv')

Data Structure

Shape and Basic Structure

In [6]:
health_df.shape
Out[6]:
(18329, 17)
In [7]:
health_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18329 entries, 0 to 18328
Data columns (total 17 columns):
Indicator Category                  18329 non-null object
Indicator                           18329 non-null object
Shortened Indicator Name            18329 non-null object
Shortened Indicator Name (Graph)    1643 non-null object
Year                                18329 non-null int64
Sex                                 18329 non-null object
Race/Ethnicity                      18329 non-null object
Value                               18329 non-null object
Place                               18329 non-null object
BCHC Requested Methodology          18329 non-null object
Source                              15907 non-null object
Methods                             5084 non-null object
Notes                               5054 non-null object
90% Confidence Level - Low          2359 non-null object
90% Confidence Level - High         2359 non-null object
95% Confidence Level - Low          2264 non-null object
95% Confidence Level - High         2264 non-null object
dtypes: int64(1), object(16)
memory usage: 2.4+ MB
In [8]:
health_df.describe(include=['O'])
Out[8]:
Indicator Category Indicator Shortened Indicator Name Shortened Indicator Name (Graph) Sex Race/Ethnicity Value Place BCHC Requested Methodology Source Methods Notes 90% Confidence Level - Low 90% Confidence Level - High 95% Confidence Level - Low 95% Confidence Level - High
count 18329 18329 18329 1643 18329 18329 18329 18329 18329 15907 5084 5054 2359 2359 2264 2264
unique 11 53 53 6 3 11 2915 27 322 640 254 211 583 654 817 948
top Chronic Disease Percent of Population Uninsured Population Uninsured Rate of Lab Confirmed Salmonella Infections Both All 0.0 U.S. Total All cancer mortality rate per 100,000 populati... DP05 - Demographic and Housing Estimates: ACS ... Rates per 100,000 calculated using the Minneap... Deaths for which cause was listed as Ã’deferred... 0.1 0.3 4.7 13.4
freq 3150 706 706 503 12737 8107 253 1288 674 780 566 342 31 28 19 13

Hmm, it looks like some of the floats are being cast as objects. I'll have to clean those up in the next section.

In [9]:
health_df.describe(include=['int64'])
Out[9]:
Year
count 18329.000000
mean 2012.393529
std 1.332788
min 2010.000000
25% 2011.000000
50% 2012.000000
75% 2013.000000
max 2016.000000
In [10]:
health_df.head()
Out[10]:
Indicator Category Indicator Shortened Indicator Name Shortened Indicator Name (Graph) Year Sex Race/Ethnicity Value Place BCHC Requested Methodology Source Methods Notes 90% Confidence Level - Low 90% Confidence Level - High 95% Confidence Level - Low 95% Confidence Level - High
0 Behavioral Health/Substance Abuse Percent of Adults Who Binge Drank Adult Binge Drinking NaN 2010 Both All 14.5 Baltimore, MD BRFSS (or similar) How many times during the ... CDC BRFSS The three most recent years of available data ... Due to changes in BRFSS sampling methodology, ... NaN NaN NaN NaN
1 Behavioral Health/Substance Abuse Percent of Adults Who Binge Drank Adult Binge Drinking NaN 2010 Both Black 9.5 Baltimore, MD BRFSS (or similar) How many times during the ... CDC BRFSS The three most recent years of available data ... Due to changes in BRFSS sampling methodology, ... NaN NaN NaN NaN
2 Behavioral Health/Substance Abuse Percent of Adults Who Binge Drank Adult Binge Drinking NaN 2010 Both White 21.1 Baltimore, MD BRFSS (or similar) How many times during the ... CDC BRFSS The three most recent years of available data ... Due to changes in BRFSS sampling methodology, ... NaN NaN NaN NaN
3 Behavioral Health/Substance Abuse Percent of Adults Who Binge Drank Adult Binge Drinking NaN 2010 Female All 9.7 Baltimore, MD BRFSS (or similar) How many times during the ... CDC BRFSS The three most recent years of available data ... Due to changes in BRFSS sampling methodology, ... NaN NaN NaN NaN
4 Behavioral Health/Substance Abuse Percent of Adults Who Binge Drank Adult Binge Drinking NaN 2010 Male All 20.3 Baltimore, MD BRFSS (or similar) How many times during the ... CDC BRFSS The three most recent years of available data ... Due to changes in BRFSS sampling methodology, ... NaN NaN NaN NaN
In [11]:
health_df.tail()
Out[11]:
Indicator Category Indicator Shortened Indicator Name Shortened Indicator Name (Graph) Year Sex Race/Ethnicity Value Place BCHC Requested Methodology Source Methods Notes 90% Confidence Level - Low 90% Confidence Level - High 95% Confidence Level - Low 95% Confidence Level - High
18324 Maternal and Child Health Percent of Mothers Under Age 20 Teen Mothers NaN 2012 Female All 8.5 Washington, DC Percentage of mothers giving birth under 20 ye... NaN NaN NaN NaN NaN NaN NaN
18325 Maternal and Child Health Percent of Mothers Under Age 20 Teen Mothers NaN 2012 Female Asian/PI 0.7 Washington, DC Percentage of mothers giving birth under 20 ye... NaN NaN NaN NaN NaN NaN NaN
18326 Maternal and Child Health Percent of Mothers Under Age 20 Teen Mothers NaN 2012 Female Black 14.1 Washington, DC Percentage of mothers giving birth under 20 ye... NaN NaN NaN NaN NaN NaN NaN
18327 Maternal and Child Health Percent of Mothers Under Age 20 Teen Mothers NaN 2012 Female Hispanic 8.0 Washington, DC Percentage of mothers giving birth under 20 ye... NaN NaN NaN NaN NaN NaN NaN
18328 Maternal and Child Health Percent of Mothers Under Age 20 Teen Mothers NaN 2012 Female White 0.5 Washington, DC Percentage of mothers giving birth under 20 ye... NaN NaN NaN NaN NaN NaN NaN
In [12]:
len(health_df[health_df.isnull().any(axis=1)])
Out[12]:
18329

Looks like all 18,329 rows have at least one column with a missing value.

Missing Data

In [19]:
msno.matrix(health_df)
In [20]:
msno.bar(health_df)
In [21]:
msno.heatmap(health_df)

Indicator Counts

In [22]:
indicator_categories = health_df['Indicator Category'].value_counts()
x = indicator_categories.index
y = indicator_categories.values
plt.figure(figsize=(12,8))
plt.bar(range(len(x)), y)
plt.xticks(range(len(x)), x, rotation=60, ha='right')
plt.title('Indicator Category')
plt.show()
In [23]:
categories = set(health_df['Indicator Category'])

for category in categories:
    category_counts = health_df[health_df['Indicator Category'] == category]['Indicator'].value_counts()
    x = category_counts.index
    y = category_counts.values
    plt.figure(figsize=(12,8))
    plt.bar(range(len(x)), y)
    plt.xticks(range(len(x)), x, rotation=60, ha='right')
    plt.title(category)
    plt.show()
    

Data Cleaning

Remove Extra Whitespace

In [13]:
object_cols = health_df.select_dtypes(include=['object']).columns

for object_col in object_cols:
    health_df[object_col] = health_df[object_col].str.rstrip()

Convert Year to Date Object

In [14]:
health_df.Year = pd.to_datetime(health_df.Year, format='%Y')

Convert Values to Floats

In [15]:
float_cols = [
    'Value', 
    '90% Confidence Level - Low', 
    '90% Confidence Level - High',
    '95% Confidence Level - Low',
    '95% Confidence Level - High'
]
In [16]:
for float_col in float_cols:    
    health_df[float_col] = health_df[float_col].str.replace(',', '')
    health_df[float_col] = health_df[float_col].str.replace('\xc2\xa0', '')
    health_df[float_col] = pd.to_numeric(health_df[float_col])
In [17]:
health_df.describe()
Out[17]:
Value 90% Confidence Level - Low 90% Confidence Level - High 95% Confidence Level - Low 95% Confidence Level - High
count 1.832900e+04 2.359000e+03 2.359000e+03 2264.000000 2264.000000
mean 5.881867e+04 2.144293e+04 2.146457e+04 58.185689 76.677871
std 4.049222e+06 2.095064e+05 2.096142e+05 141.721620 176.471644
min 0.000000e+00 -9.000000e-01 2.000000e-01 -0.100000 0.100000
25% 7.000000e+00 6.800000e+00 9.500000e+00 6.200000 11.975000
50% 1.590000e+01 1.190000e+01 1.620000e+01 14.950000 24.750000
75% 4.130000e+01 2.710000e+01 3.355000e+01 37.900000 54.525000
max 3.188571e+08 3.928733e+06 3.928921e+06 1558.400000 1996.500000

Minor annoyance

In [18]:
health_df.rename(columns ={'Place': 'Location'}, inplace=True)

Reshaping the Data

It's nice that the data started off in tidy format, but it makes it harder to compare variables using multivariate analyses. We'll have to reshape the data so that each indicator has it's own column. This will also give us a better understanding of the sparseness of the dataset.

In [37]:
health_reshaped = health_df.pivot_table(
    index=['Location', 'Year', 'Sex', 'Race/Ethnicity'],
    columns='Indicator',
    values='Value'
)

health_reshaped.reset_index(inplace=True)
health_reshaped.columns.name = None
health_reshaped.loc[:,'City'] = health_reshaped['Location'].str.split(', ').str.get(0)
health_reshaped.loc[:,'State'] = health_reshaped['Location']
health_reshaped.loc[health_reshaped['State']!='U.S. Total','State'] = health_reshaped['Location'].str.split(', ').str.get(1)
del health_reshaped['Location']
cols = health_reshaped.columns.tolist()
col_order = cols[-2:] + cols[:-2]
health_reshaped = health_reshaped[col_order]
health_reshaped.head()
Out[37]:
City State Year Sex Race/Ethnicity AIDS Diagnoses Rate (Per 100,000 people) All Types of Cancer Mortality Rate (Age-Adjusted; Per 100,000 people) All-Cause Mortality Rate (Age-Adjusted; Per 100,000 people) Asthma Emergency Department Visit Rate (Age-Adjusted; Per 10,000) Diabetes Mortality Rate (Age-Adjusted; Per 100,000 people) Female Breast Cancer Mortality Rate (Age-Adjusted; Per 100,000 people) Firearm-Related Emergency Department Visit Rate (Age-Adjusted; Per 10,000 people) Firearm-Related Mortality Rate (Age-Adjusted; Per 100,000 people) HIV Diagnoses Rate (Per 100,000 people) HIV-Related Mortality Rate (Age-Adjusted; Per 100,000 people) Heart Disease Mortality Rate (Age-Adjusted; Per 100,000 people) Homicide Rate (Age-Adjusted; Per 100,000 people) Infant Mortality Rate (Per 1,000 live births) Life Expectancy at Birth (Years) Lung Cancer Mortality Rate (Age-Adjusted; Per 100,000 people) Median Household Income (Dollars) Motor Vehicle Mortality Rate (Age-Adjusted; Per 100,000 people) Opioid-Related Unintentional Drug Overdose Mortality Rate (Age-Adjusted; Per 100,000 people) Percent Foreign Born Percent Living Below 200% Poverty Level Percent Unemployed Percent Who Only Speak English at Home Percent Who Speak Spanish at Home Percent of 3 and 4 Year Olds Currently Enrolled in Preschool Percent of Adults 65 and Over Who Received Pneumonia Vaccine Percent of Adults Who Are Obese Percent of Adults Who Binge Drank Percent of Adults Who Currently Smoke Percent of Adults Who Meet CDC-Recommended Physical Activity Levels Percent of Adults Who Received Seasonal Flu Shot Percent of Children (Tested) Under Age 6 with Elevated Blood Lead Levels Percent of Children Living in Poverty Percent of Children Who Received Seasonal Flu Shot Percent of High School Graduates (Over Age 18) Percent of High School Students Who Are Obese Percent of High School Students Who Binge Drank Percent of High School Students Who Currently Smoke Percent of High School Students Who Meet CDC-Recommended Physical Activity Levels Percent of Households Whose Housing Costs Exceed 35% of Income Percent of Low Birth Weight Babies Born Percent of Mothers Under Age 20 Percent of Population 65 and Over Percent of Population Under 18 Percent of Population Uninsured Persons Living with HIV/AIDS Rate (Per 100,000 people) Pneumonia and Influenza Mortality Rate (Age-Adjusted; Per 100,000 people) Race/Ethnicity (Percent) Rate of Laboratory Confirmed Infections Caused by Salmonella (Per 100,000 people) Rate of Laboratory Confirmed Infections Caused by Shiga Toxin-Producing E-Coli (Per 100,000 people) Sex (Percent) Suicide Rate (Age-Adjusted; Per 100,000 people) Total population (People) Tuberculosis Incidence Rate (Per 100,000 people)
0 Baltimore MD 2010-01-01 Both All 57.9 NaN NaN NaN NaN NaN NaN NaN 77.6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 54.6 29.2 14.5 24.3 68.4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2495.7 NaN NaN NaN NaN NaN NaN NaN NaN
1 Baltimore MD 2010-01-01 Both Asian/PI NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 174.5 NaN NaN NaN NaN NaN NaN NaN NaN
2 Baltimore MD 2010-01-01 Both Black 78.5 NaN NaN NaN NaN NaN NaN NaN 108.3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 47.6 32.8 9.5 29.4 62.2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3449.0 NaN NaN NaN NaN NaN NaN NaN NaN
3 Baltimore MD 2010-01-01 Both Hispanic 44.4 NaN NaN NaN NaN NaN NaN NaN 59.1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1232.3 NaN NaN NaN NaN NaN NaN NaN NaN
4 Baltimore MD 2010-01-01 Both White 15.8 NaN NaN NaN NaN NaN NaN NaN 19.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 62.6 25.4 21.1 19.4 75.5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 859.8 NaN NaN NaN NaN NaN NaN NaN NaN

As expected, there are a lot of null values throughout. Each study was carried out by a different agency and pieced together, so while there may be some overlap, it's not a guarantee. This will make modeling difficult. We may have to just stick with visualizations.

In [38]:
health_reshaped.to_csv('data/bhmi_reshaped.csv')

Visualizations

  1. Correlation matrix
  2. Bar plots
  3. Histograms
  4. Violin plots
  5. Wordcloud
  6. Scatterplots colored by label
    1. Add scatterplot with histograms on each axis
  7. Line plot for time data
  8. Choropleth for location data
  9. 2d graph from Uber data
  10. Graph / tree
  11. New graph from kaggle

Get indicator name and category value counts.

Reshaped Missing Data

In [20]:
msno.matrix(health_reshaped)
In [21]:
msno.bar(health_reshaped)
In [34]:
msno.heatmap(health_reshaped)

Look at violin plots for chronic disease indicators by location.

Compare locations based on:

  1. All types of cancer rate
  2. AIDS diagnoses rate
  3. People living with HIV rate
  4. Percent of adults who are obese
  5. Percent of adults who meet activity levels

Need to drop outliers before creating correlation matrix. For loop?

In [25]:
health_corr_mat = health_reshaped[health_reshaped['Rate of Laboratory Confirmed Infections Caused by Shiga Toxin-Producing E-Coli (Per 100,000 people)']<200].corr()
plt.figure(figsize=(12,8))
sns.heatmap(health_corr_mat)
plt.show()

Kendall's seems to be more robust to outliers.

In [26]:
health_corr_mat = health_reshaped.corr(method='kendall')
plt.figure(figsize=(12,8))
sns.heatmap(health_corr_mat)
plt.show()
In [68]:
axs = pd.plotting.scatter_matrix(health_reshaped, figsize=(20,16))
n = 53
for x in range(n):
    for y in range(n):
        # to get the axis of subplots
        ax = axs[x, y]
        # to make x axis name vertical  
        ax.xaxis.label.set_rotation(90)
        # to make y axis name horizontal 
        ax.yaxis.label.set_rotation(0)
        # to make sure y axis names are outside the plot area
        ax.yaxis.label.set_ha('right')

Bivariate Associations

The goal's pretty simple: investigate any relationships with strong positive or negative correlations.

In [27]:
sns.lmplot(
    x='Percent of Adults Who Received Seasonal Flu Shot',
    y='Percent of High School Graduates (Over Age 18)',
    data=health_reshaped,
    size=8,
    line_kws={'color': 'red'},
    scatter_kws={'color': 'red'}
)
plt.show()
In [26]:
sns.lmplot(
    x='Percent of Households Whose Housing Costs Exceed 35% of Income',
    y='Percent of High School Students Who Currently Smoke',
    data=health_reshaped,
    size=8,
    line_kws={'color': 'green'},
    scatter_kws={'color': 'green'}
)
plt.show()

Chronic Disease

Let's examine the distributions of chronic diseases and how they interact with other variables.

In [38]:
health_reshaped.describe()
Out[38]:
AIDS Diagnoses Rate (Per 100,000 people) All Types of Cancer Mortality Rate (Age-Adjusted; Per 100,000 people) All-Cause Mortality Rate (Age-Adjusted; Per 100,000 people) Asthma Emergency Department Visit Rate (Age-Adjusted; Per 10,000) Diabetes Mortality Rate (Age-Adjusted; Per 100,000 people) Female Breast Cancer Mortality Rate (Age-Adjusted; Per 100,000 people) Firearm-Related Emergency Department Visit Rate (Age-Adjusted; Per 10,000 people) Firearm-Related Mortality Rate (Age-Adjusted; Per 100,000 people) HIV Diagnoses Rate (Per 100,000 people) HIV-Related Mortality Rate (Age-Adjusted; Per 100,000 people) Heart Disease Mortality Rate (Age-Adjusted; Per 100,000 people) Homicide Rate (Age-Adjusted; Per 100,000 people) Infant Mortality Rate (Per 1,000 live births) Life Expectancy at Birth (Years) Lung Cancer Mortality Rate (Age-Adjusted; Per 100,000 people) Median Household Income (Dollars) Motor Vehicle Mortality Rate (Age-Adjusted; Per 100,000 people) Opioid-Related Unintentional Drug Overdose Mortality Rate (Age-Adjusted; Per 100,000 people) Percent Foreign Born Percent Living Below 200% Poverty Level Percent Unemployed Percent Who Only Speak English at Home Percent Who Speak Spanish at Home Percent of 3 and 4 Year Olds Currently Enrolled in Preschool Percent of Adults 65 and Over Who Received Pneumonia Vaccine Percent of Adults Who Are Obese Percent of Adults Who Binge Drank Percent of Adults Who Currently Smoke Percent of Adults Who Meet CDC-Recommended Physical Activity Levels Percent of Adults Who Received Seasonal Flu Shot Percent of Children (Tested) Under Age 6 with Elevated Blood Lead Levels Percent of Children Living in Poverty Percent of Children Who Received Seasonal Flu Shot Percent of High School Graduates (Over Age 18) Percent of High School Students Who Are Obese Percent of High School Students Who Binge Drank Percent of High School Students Who Currently Smoke Percent of High School Students Who Meet CDC-Recommended Physical Activity Levels Percent of Households Whose Housing Costs Exceed 35% of Income Percent of Low Birth Weight Babies Born Percent of Mothers Under Age 20 Percent of Population 65 and Over Percent of Population Under 18 Percent of Population Uninsured Persons Living with HIV/AIDS Rate (Per 100,000 people) Pneumonia and Influenza Mortality Rate (Age-Adjusted; Per 100,000 people) Race/Ethnicity (Percent) Rate of Laboratory Confirmed Infections Caused by Salmonella (Per 100,000 people) Rate of Laboratory Confirmed Infections Caused by Shiga Toxin-Producing E-Coli (Per 100,000 people) Sex (Percent) Suicide Rate (Age-Adjusted; Per 100,000 people) Total population (People) Tuberculosis Incidence Rate (Per 100,000 people)
count 527.000000 687.000000 517.000000 255.000000 630.000000 393.000000 194.000000 520.000000 632.000000 500.000000 669.000000 526.000000 418.000000 262.000000 619.000000 45.000000 572.000000 335.000000 45.000000 48.000000 631.00000 78.000000 78.000000 78.000000 241.000000 378.000000 350.000000 375.000000 189.000000 294.000000 166.000000 46.000000 135.000000 45.000000 202.000000 214.000000 204.000000 219.000000 78.000000 550.000000 569.000000 78.000000 78.000000 706.000000 625.000000 569.000000 468.000000 503.000000 355.000000 156.000000 585.000000 7.800000e+01 565.000000
mean 23.403321 154.158297 690.400677 98.463922 24.538413 19.951654 4.777835 10.923846 32.544778 6.376400 156.256129 10.613118 6.338517 79.754198 37.504766 49870.533333 7.411538 5.571642 21.044444 43.020833 10.52187 65.926923 21.788462 50.364103 57.379253 25.479365 20.678000 18.122800 45.097884 38.064966 4.780422 31.441304 42.060741 82.846667 12.782921 13.626168 9.864216 22.468037 33.178205 8.603909 7.897452 11.520513 22.238462 16.819263 823.648400 15.568893 16.589957 15.880318 2.403380 50.000000 10.069744 1.377479e+07 8.387965
std 51.342062 53.019138 262.317598 179.942567 14.504066 17.044399 16.245432 11.371765 36.695355 7.304881 71.919055 12.805201 3.562505 4.379581 20.356937 13808.332505 6.483494 6.008951 11.430473 9.547930 4.98356 16.135720 14.248764 12.035888 19.386157 8.417198 9.538653 7.041825 16.758139 11.377420 4.901755 11.160000 15.727360 4.864042 5.368382 7.662405 4.728224 7.794551 5.152799 2.873348 7.103914 1.554830 3.479074 8.093542 784.013531 14.109223 18.469194 34.310724 13.876729 1.489923 7.089926 6.092072e+07 8.888781
min 0.000000 0.000000 109.400000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 68.000000 0.000000 23600.000000 0.000000 0.000000 4.300000 25.500000 1.70000 27.700000 3.600000 28.200000 0.100000 3.600000 2.100000 1.500000 10.400000 2.200000 0.000000 14.800000 2.700000 74.800000 0.100000 1.800000 2.200000 5.700000 23.600000 0.000000 0.000000 8.100000 13.400000 1.900000 2.900000 0.000000 0.100000 0.000000 0.000000 47.100000 0.000000 3.895240e+05 0.000000
25% 7.100000 122.100000 520.000000 34.800000 16.625000 12.600000 0.500000 3.675000 12.275000 2.100000 111.300000 2.700000 3.900000 77.000000 26.100000 42266.000000 3.900000 2.200000 14.200000 38.525000 6.85000 55.575000 8.650000 41.425000 50.200000 20.300000 14.500000 13.200000 32.000000 31.800000 1.600000 24.650000 34.350000 80.200000 9.250000 7.900000 6.500000 17.000000 29.300000 6.800000 3.200000 10.500000 21.025000 10.800000 276.700000 10.400000 2.000000 7.400000 0.500000 49.200000 5.100000 6.343340e+05 3.600000
50% 12.900000 155.600000 670.500000 60.900000 21.900000 19.800000 1.400000 7.800000 22.950000 3.800000 146.800000 5.700000 5.300000 79.600000 36.500000 47604.000000 6.150000 4.300000 19.800000 41.700000 9.40000 63.750000 21.550000 49.400000 61.400000 25.300000 19.000000 17.400000 46.900000 38.250000 3.150000 31.400000 42.200000 82.200000 13.600000 11.600000 8.650000 21.300000 33.000000 8.000000 6.300000 11.500000 22.550000 15.300000 592.000000 14.400000 8.700000 11.900000 1.000000 50.000000 8.400000 1.199495e+06 5.900000
75% 24.500000 184.450000 810.500000 113.700000 28.475000 24.400000 3.900000 14.025000 41.000000 8.325000 192.700000 13.450000 8.000000 82.575000 47.300000 53583.000000 9.700000 7.050000 27.700000 46.950000 12.85000 78.750000 30.550000 57.900000 69.300000 31.475000 25.000000 21.400000 57.200000 43.575000 6.625000 35.775000 53.650000 86.400000 16.000000 17.575000 12.525000 26.550000 36.075000 10.000000 11.000000 12.375000 24.575000 21.375000 1056.000000 18.200000 28.550000 16.900000 2.000000 50.800000 13.800000 2.473709e+06 9.800000
max 736.600000 453.500000 3594.600000 2034.900000 170.300000 272.400000 173.200000 89.600000 484.300000 69.200000 549.200000 86.900000 21.000000 96.800000 316.900000 80977.000000 74.400000 69.400000 51.300000 65.300000 30.00000 91.300000 64.000000 86.400000 92.100000 51.000000 87.500000 45.000000 77.600000 76.100000 36.400000 61.400000 73.000000 93.700000 29.200000 40.200000 29.000000 52.500000 46.200000 17.700000 59.500000 15.200000 27.500000 47.800000 4199.600000 263.400000 81.400000 496.700000 248.300000 52.900000 74.300000 3.188571e+08 109.200000
In [48]:
chronic_diseases = set(health_df[health_df['Indicator Category']=='Chronic Disease'].Indicator)

for chronic_disease in chronic_diseases:
    plt.figure(figsize=(12,8))
    sns.distplot(health_reshaped[chronic_disease].dropna(), bins=30)
    plt.show()
In [24]:
plt.figure(figsize=(12,8))
sns.violinplot(
    data=health_reshaped,
    x='Race/Ethnicity',
    y='Percent of High School Students Who Currently Smoke'
)
plt.xticks(rotation=60, ha='right')
plt.show()

Behavioral/Substance Abuse

In [44]:
plt.figure(figsize=(12,8))
sns.violinplot(
    x='Race/Ethnicity',
    y='Percent of Adults Who Binge Drank',
    data=health_reshaped[health_reshaped['Race/Ethnicity']!='All'],
)
plt.xticks(rotation=60, ha='right')
plt.show()

Life Expectancy/Death Rate

In [39]:
#plt.figure(figsize=(12,8))
sns.lmplot(
    x='Firearm-Related Mortality Rate (Age-Adjusted; Per 100,000 people)',
    y='Homicide Rate (Age-Adjusted; Per 100,000 people)',
    data=health_reshaped[health_reshaped['Race/Ethnicity']!='All'],
    hue='State',
    size=8,
    ci=None,
    fit_reg=False
)
plt.show()
In [36]:
sns.lmplot(
    x='Median Household Income (Dollars)',
    y='Life Expectancy at Birth (Years)',
    data=health_reshaped,
    size=8,
)
plt.show()
In [23]:
plt.figure(figsize=(12,8))
sns.violinplot(
    data=health_reshaped,
    x='Race/Ethnicity',
    y='Life Expectancy at Birth (Years)'
)
plt.xticks(rotation=60, ha='right')
plt.show()
In [29]:
plt.figure(figsize=(12,8))
sns.violinplot(
    data=health_reshaped[(health_reshaped['Sex']!='Both') & (health_reshaped['Race/Ethnicity']=='All')],
    x='Race/Ethnicity',
    y='Life Expectancy at Birth (Years)',
    hue='Sex',
    split=True
)
plt.xticks(rotation=60, ha='right')
plt.show()